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Abstract In this paper we describe a novel framework for the discovery 
of the topical content of a data corpus, and the tracking of its complex 
structural changes across the temporal dimension. In contrast to previous 
work our model does not impose a prior on the rate at which documents 
are added to the corpus nor does it adopt the Markovian assumption 
which overly restricts the type of changes that the model can capture. 
Our key technical contribution is a framework based on (i) discretization 
of time into epochs, (ii) epoch-wise topic discovery using a hierarch¬ 
ical Dirichlet process-based model, and (iii) a temporal similarity graph 
which allows for the modelling of complex topic changes: emergence and 
disappearance, evolution, and splitting and merging. The power of the 
proposed framework is demonstrated on the medical literature corpus 
concerned with the autism spectrum disorder (ASD) - an increasingly 
important research subject of significant social and healthcare import¬ 
ance. In addition to the collected ASD literature corpus which we will 
make freely available, our contributions also include two free online tools 
we built as aids to ASD researchers. These can be used for semantic- 
ally meaningful navigation and searching, as well as knowledge discovery 
from this large and rapidly growing corpus of literature. 


1 Introduction 

The autism spectrum disorder (ASD) is a life-long neurodevelopmental disorder 
with poorly understood causes on the one hand, and a wide range of potential 
treatments supported by little evidence on the other. The disorder is character¬ 
ized by severe impairments in social interaction, communication, and in some 
cases cognitive abilities. Considering the social and economic burden of ASD 
it is unsurprising that it has been attracting an increasing amount of research 
attention which has resulted in a rapid growth of the relevant corpus of liter¬ 
ature. Navigating this vast amount of data by conventional, manual means is 
difficult and limiting; yet the rapid rise in the diagnosis rate of ASD demands 
timely research on its aetiology and treatment. Consequently, the potential be¬ 
nefit of tools based on novel data-mining and machine learning techniques is 
immense [T]. More meaningful ways for visualising or searching for data could 
provide invaluable information in clinical and administrative decision making as 
well as aid research, while automatic knowledge discovery would in its own right 


advance the understanding of the underlying phenomena (e.g. epidemiological 
patterns). We describe a novel method which contributes towards this goal. 

More specifically, we describe a general framework for the analysis of med¬ 
ical literature capable of (i) discovering the underlying topical structure, (ii) 
inferring the relationships between different discovered topics, and (iii) tracking 
the evolution of topics over time. The proposed framework uses the hierarchical 
Dirichlet process (HDP) to extract topics automatically, and then constructs a 
similarity graph over them using an inter-topic similarity measure; topic evolu¬ 
tion over time can be inferred from this graph. The effectiveness of our approach 
is demonstrated on the specific example of a large longitudinal data corpus of 
medical literature on ASD which we collected. This corpus includes more than 
18,000 articles published over the course of 42 years. In addition to the afore¬ 
mentioned technical contributions, our further contribution is this corpus which 
will be made public following the publication of the present paper. 

The results we report on the collected ASD literature corpus illustrate the 
usefulness of our method and its ability to extract and track over time abstract 
topical knowledge, inferring the point at which a certain topic comes into exist¬ 
ence, how its evolves, splits into multiple new topics or merges with the existing 
ones, and lastly when it ceases to exist. This is demonstrated on examples of 
well-known research directions in the field, making out work the first to exam¬ 
ine the medical literature on ASD using advanced topic modelling tools. Our 
additional contributions come in the form of two free online tools which allow 
researchers to (i) navigate and search the literature in a semantically meaning¬ 
ful manner (see http://www.undersdtanfigutism.tk), and (ii) understand the 
development and relationships between different ideas which permeate research 
in the domain of ASD (see http://www.goo.gl/Ws7V64). 

2 Previous work 

In this section we review the most relevant previous work on topic modelling. We 
focus our attention first on latent topic models which have dominated the field 
in the last decade, and then on biomedical text mining, given the application 
domain within which our framework is evaluated in Section |4j 


2.1 Latent topic models 

An important early topic modelling approach is the latent semantic indexing 
(LSI) [ 2 ] which remains popular. Two notable limitations of LSI are its inability 
to deal effectively with polysemy and to produce an explicit description of the 
latent space. A probabilistic improvement overcomes these by explicitly charac¬ 
terizing the latent space with semantic topics, and by employing a probabilistic 
generative model that addresses the polysemy problem [3]. Nevertheless, prob¬ 
abilistic LSI is prone to parameter overfitting caused by an uncontrolled growth 
in the number of parameters as the document corpus is increased. In addition, 
the necessary assignment of probabilities to documents is a nontrivial task [5] . 


The recently proposed latent Dirichlet allocation (LDA) method [3] over¬ 
comes the overfitting problem by adopting a Bayesian framework and a gener¬ 
ative process at the document level. While LDA has quickly become a standard 
tool for topic modelling, it too experiences challenges when applied on real-world 
data. In particular, being a parametric model the number of desired output topics 
has to be specified in advance. The HDP model as the nonparametric counter¬ 
part of LDA was introduced by Teh et al [5] and addressed this limitation by 
using a Dirichlet process (DP) (as opposed to a Dirichlet distribution) as the 
prior on topics. Therefore, each document is modelled using an infinite mixture 
model, allowing the data to inform the complexity of the model and infer the 
number of resulting topics automatically. We discuss this model in further detail 
in Section [3l 


Temporal topic modelling A notable limitation of most models described in 
the existing literature lies in their assumption that the data corpus is static; this 
includes those based on LDA mentioned previously, or the hierarchical Dirichlet 
process described in detail in the next section. However, in many practical ap¬ 
plications documents are added to the corpus in a temporal manner and their 
ordering has significance (non-exchangeability property). As a consequence, the 
topical structure of the corpus changes over time. The assumption made by all 
previous work, and indeed adopted by us, is that documents are not exchange¬ 
able at large temporal scales but are at short time scales, thus treating the corpus 
at temporally locally static. 

The existing work on temporal topic modelling can be divided into two groups 
of approaches both of which can be based on parametric iHIH] or nonparamet¬ 
ric pm] techniques, the former suffering from the limitation that they contain 
free parameters which must be set a priori. Methods of the first group discretize 
time into epochs, apply a static topic model to each epoch, and by making the 
Markovian assumption relate the parameters of each epoch’s topic model to those 
of the epochs adjacent to it in time |6l7l9ll()j . While the approach we propose in 
this paper adopts the idea of time discretization, it diverges in its other features 
from this group of methods thereafter. In particular, instead of employing the 
Markovian assumption we describe a novel structure in form of a temporal sim¬ 
ilarity graph, which gives our method greater flexibility, as described in detail 
in the next section. The second group of methods in the literature regard docu¬ 
ment time-stamps as observations of a continuous random variable |8lllj . This 
assumption severely limits the type of topic changes which can be described. For 
example, as opposed to our model, these models are not capable of describing 
the evolution of topics, or their splitting and merging, and are rather constrained 
to tracking simple topic popularity (rise/fall). 


2.2 Biomedical text mining 

Most previous work on text-based knowledge discovery has rather focused on (i) 
the tagging of names of entities such as genes, proteins, and diseases m, (ii) the 
discovery of relationships between different entities e.g. functional associations 




between genes m, or (iii) the extraction of information pertaining to events 
such as gene expression or protein binding [T3]. 

The idea that the medical literature could be mined for new knowledge is 
typically attributed to Swanson m- For example by manually examining med¬ 
ical literature databases he hypothesised that dietary fish oil could be beneficial 
for Raynaud’s syndrome patients, which was later confirmed by experimental 
evidence. Work that followed sought to develop statistical methods which would 
make this process automatic. Most approaches adopted the use of term frequen¬ 
cies and co-occurrences using dictionaries such as Medical Subject Headings 
(MeSH) [I6]. 

Most existing work on biomedical knowledge discovery is based on what may 
be described as traditional data mining techniques (neural networks, support 
vector machines etc); comprehensive surveys can be found in |17ll4j . The ap¬ 
plication of state-of-the-art Bayesian methods in this domain is scarce. Amongst 
the notable exceptions is the work by Blei et al. who showed how latent Dirichlet 
allocation (LDA) can be used to facilitate the process of hypothesis generation 
in the context of genetics m- Arnold et al. used a similar approach to demon¬ 
strate that abstract topic space representation is effective in patient-specific case 
retrieval m- In their later work they introduced a temporal model which learns 
topic trends and showed that the inferred topics and their temporal patterns 
correlate with valid clinical events and their sequences EOl. Wu et al. used LDA 
for gene-drug relationship ranking ED. 

3 Proposed framework 

We begin this section by reviewing the relevant theory underlying HDP mixture 
modelling which plays the central rule in the proposed framework. Then we turn 
our attention to the main technical contribution of our work and explain how 
the HDP is employed to discover the topical content of a literature corpus and 
track its structural changes over time. 

3.1 Hierarchical Dirichlet process mixture models 

The Dirichlet process is a useful prior for mixture modelling which allows a 
document collection to accommodate a potentially infinite number of topics. 
It is the building block of Bayesian nonparametric methods. A Dirichlet pro¬ 
cess m DP ( 7 , H) is defined as a distribution of a random probability measure 
G over a measure space (0,B,/i), such that for any finite measurable parti¬ 
tion (Al, A 2 ,..., Ar) of 0 the random vector (G (Ai),..., G (A^)) is a Dirich¬ 
let distribution with parameters (yiL (Ai),... , 7 iL (A^)). An alternative view 
of the DP emerges from the so-called stick-breaking process which adopts a 
constructive approach using a sequence of discrete draws ESI. Specifically, if 

G ^ DP( 7 ,iL) then G = where H and = (/d/c)^i is the 

vector of weights obtained as pk = Vk Yl^=i (1 — vi) and vi Beta (1,7). 

Owing to the discrete nature and infinite dimensionality of its draws, the DP 
is a highly useful prior for Bayesian mixture models. By associating different 




mixture components with atoms c^k of the stick-breaking process, and assuming 

Xi\(l)k F {xi\(l)k) where F (.) is the likelihood kernel of the mixing components, 
we can formulate the Dirichlet process mixture model (DPM). The DPM is 
suitable for nonparametric clustering of exchangeable data in a single group e.g. 
words in a document where the DPM models the underlying structure of the 
document with potentially an infinite number of topics. However, many real- 
world problems are more appropriately modelled as comprising multiple groups 
of exchangeable data (e.g. a collection of documents). In such cases it is usually 
desirable to model the observations of different groups jointly, allowing them 
to share their generative clusters. This idea is known as the sharing statistical 
strength and is achieved using a hierarchical structure. 

Amongst different ways of linking group-level DPMs, HDP [5] offers an inter¬ 
esting solution whereby base measures of document-level DPs are drawn from 
another DP. In this way the atoms of the corpus-level DP (i.e. topics in our 
case) are shared across the corpus. Formally, if x = {xi,...,xj} is a docu¬ 
ment collection where x^ = {xji,..., xjjsf. } is the j-th document comprising Nj 

words, each document is modelled with a DPM Gjl^o, Gq DP (ao, Go) where 
its DP prior is further endowed by another DP G 0 I 7 , ^ DP( 7 ,i^). This is 


illustrated schematically in Figure la Since the base measure of Gj is drawn 
from Go, it takes the same support as Gq. Also the parameters of the group-level 
mixture components, Oji^ share their values with the corpus-level DP support on 
{ 015 02, • • •}• Therefore Gj can be equivalently expressed using the stick-breaking 
process as Gj = where 7rj|ao,7 ^ DP ((ao 5 7 )[S]. The posterior for 

Oji has been shown to follow a Chinese restaurant franchise process which can 
be used to develop inference algorithms based on Gibbs sampling [5]- 


3.2 Modelling topic evolution over time 

In this section we show how the described HDP-based model can be applied to 
the analysis of temporal topic changes in a longitudinal data corpus. We begin 
by dividing the literature corpus by time into multiple epochs. Each epoch is 
then modelled separately using an HDP. Different epochs’ models are inferred 
using the same initial corpus-level base measure and hyperparameters. Hence 
if n is the number of epochs, we obtain n sets of topics 6 = 
where Ot = { 6 ^ 1 ,t, • • • is the set of topics that describe epoch t, and Kt 

their number (which is inferred automatically, as described previously). This is 
illustrated in Figure In the next section we describe how given an inter-topic 
similarity measure the evolution of different topics across epochs can be tracked. 


3.3 Measuring topic similarity 

Our goal now is to track changes in the topical structure of a data corpus over 
time. The simplest changes of interest include the emergence of new topics, 
and the disappearance of others. More subtly, we are also interested in how a 
specific topic changes - how it evolves over time in terms of the contributions of 
different words it comprises, as well as how it splits into new topics or merges 




(a) HDP (b) Proposed 


Figure 1: (a) Graphical model representation of HDP. Each box represents one 
document whose observed data (words) is shown shaded. Unshaded nodes rep¬ 
resent latent variables. An observed datum Xji is assigned to a latent mixture 
component parametrized by Oji. 7 and a are the concentration parameters and 
H is the corpus-level base measure, (b) Graphical model representation of the 
proposed framework. The corpus is temporally divided into epochs and each 
epoch modelled using an HDP (outer boxes). 


with the existing ones. Glearly this information can provide valuable insight into 
the refinement of ideas and findings in the scientific community, effected by new 
research and accumulating evidence. 

The key idea behind our approach stems from the observation that while top¬ 
ics may change significantly over time, by their very nature the change between 
successive epochs is limited. Therefore we infer the continuity of a topic in one 
epoch by relating it to all topics in the immediately subsequent epoch which 
are sufficiently similar to it under some similarity measure. This can be seen 
to lead naturally to a similarity graph representation whose nodes correspond 
to topics and whose edges link those topics in two epochs which are related. 
Formally, the weight of the directed edge that links (pj^t 5 the j-th topic in epoch 
t, and is set equal to p ( 0 j,t, 0 /c,t+i) where p is an appropriate similarity 

measure. Given that in our HDP-based model each topic is represented by a 
probability distribution, suitable similarity metrics include the Jaccard similar¬ 
ity, the Jenson-Shannon divergence, or the L 2 -norm for example. 

A conceptual illustration of a similarity graph is shown in Figure It shows 
three consecutive time epochs t — l,t, and t + 1 and a selection of topics in 
these epochs. Graph edge weight i.e. inter-topic similarity is encoded by varying 
the thickness of the corresponding line connecting two nodes - a thicker line 
signifies more similar topics. We use a threshold to eliminate automatically weak 
edges, retaining only the edges which correspond to sufficiently similar topics 
in adjacent epochs. It can be seen that this readily allows us to detect the 
disappearance of a particular topic, the emergence of new topics, as well as the 
splitting or merging of different topics: 

































Emergence If a node does not have any edges incident to it, the corresponding 
topic is taken as having emerged in the associated epoch (e.g. 0^+2 at time 
t in Figure 2a). 


Disappearance If no edges originate from a node, the corresponding topic is 
taken to vanish in the associated epoch (e.g. at time t in Figure [^. 


Splitting If more than a single edge originates from a node, the corresponding 
topic is understood as being split into multiple topics in the next epoch (e.g. 
(j)i is split into 4>j and (pj^i in Figure 2a). 


Merging If more than a single edge is incident to a node, the topics of the nodes 
from which the edges originate are understood as having merged together 
to form a new topic (e.g. (pi and 0^+1 merge to form in Figure [^. 


4 Experimental evaluation 

Having introduced the main technical contribution of our work we now illustrate 
its usefulness on the example of ASD literature analysis, and describe additional 
contributions in the form of two free online tools that we developed to aid ASD 
researchers. 

4.1 Data collection 

To the best of our knowledge there are no publicly available corpora of ASD- 
related medical literature. Hence we collected a comprehensive dataset ourselves, 
which will be made public following the acceptance of the present paper. We 
describe our collection methodology and the pre-processing of data we performed 
to extract standard features used for text analysis. 

Raw data collection We used the PubMed search engine that allows users 
to access the US National Library of Medicine for abstracts and references of 
life science and biomedical scholarly articles. We assumed a paper is related to 
ASD if the term “autism” is present in its title or abstract, and collected only 
papers written in English. The earliest publication fitting our criteria is that 
by Kanner m, and we collected all matching publications up to the final one 
indexed by PubMed on 24th July 2014, yielding a corpus of 20,138 publications. 
We discarded the 1,946 which do not have an abstract indexed, ending with the 
total of 18,192 papers in our dataset. We used the abstracts text to evaluate our 
method. 

Data pre-processing Following the standard practice in text processing liter¬ 
ature we applied soft lemmatization on the abstracts in our dataset, using the 
freely available WordNet tool [25]. No stemming was performed to avoid po¬ 
tential distortion of words which is sometimes effected by heuristic rules used 
by stemming algorithms. After lemmatization and the removal of so-called stop 
words, we obtained 1.9 million terms in the entire corpus when repetitions are 
counted, and 37,278 unique terms. We construct the vocabulary for our method 
by selecting the subset of the most frequent unique terms which explain 90% of 
the energy of the corpus, which resulted in a 3,738 term vocabulary. 




4.2 Proposed method implementation 

We divided the 42 year timespan of our data corpus into overlapping five year 
epochs, with a two year lag between consecutive epochs, resulting in 18 epochs 
in total. The topics of each epoch were then extracted as described in Section 
and their dynamics inferred as per Section |3.3| The number of latent topics of 
different epoch is plotted in Figure Notice the exponential rise in the number 
of topics which mirrors the exponential increase in the number of publications 
over time in our dataset. This increasing interest in ASD can be illustrated 
by the observation that in 2013 there are five times as many publications as 
in 2000. For our inter-topic similarity described in Section we adopted the 
use of the well-known Jaccard similarity; this similarity measure was used to 
obtain all results reported in this section. Lastly, Gibbs sampling was used for 
HDP inference, implemented in Python 2.7, with hyperparameter resampling as 
described by Teh et al [5]. 



t-1 t t+1 


(a) Topic similarity graph 



(b) Number of topics per epoch 


Figure 2: (a) Conceptual illustration of the proposed similarity graph that models 
topic dynamics over time. A node corresponds to a topic in a specific epoch; edge 
weights are equal to the corresponding topic similarities, (b) As the document 
corpus grows so does the number of topics needed to model its latent structure. 


4.3 Case study 1: ASD and genetics 

While the exact aetiology of the ASD is still poorly understood, the existence 
of a significant genetic component is beyond doubt [26] . Work on understand¬ 
ing complex genetic factors affecting the development of autism, which possibly 
involve multiple genes which interact with each other and the environment, is a 
major theme of research and as such a good case study on which the usefulness 
of the proposed method can be illustrated. 

We started by identifying the topic of interest as that with the highest prob¬ 
ability of the terms “gene” or “genetic” conditioned on the topic, and tracing it 
back in time to the epoch in which it originated. This led to the discovery of 
the relevant topic in the epoch spanning the period 1986-1991. Figure shows 
the evolution of this topic from 1992 revealed by our method (due to space con¬ 
straints only the most significant parts of the similarity graph are shown; minor 













Figure3: Interactive similarity graph analysis tool (see www.goo.gl/Ws7V64). 
Word clouds of a few topics are shown for illustration. Nodes and links between 
them represent respectively topics in particular epochs and their similarities. 


changes to the topic before 1992 are also omitted for clarity, as indicated by the 
dotted line in the figure). Each topic is labelled with its first few dominant terms. 
The following interpretation of our findings is readily apparent. Firstly, in the 
period 1992-1997, the topic is rather general in nature. Over time it evolves and 
splits into topics which concern more specific concepts (recall that such splitting 
of topics cannot be captured by any of the existing methods). For example by 
the epoch 2002-2007 the single original topic has evolved and split into four 
topics which concern: 

— the relationship between mutations in the gene mecp2 (essential for normal 
functioning of neurone), and mental disorders and epilepsy (it is estimated 
that one third of ASD individuals also have epilepsy), 

— gene alternations, for example the duplication of 15qll-13 and deletion of 
16pll .2 both of which are associated with ASD, 

— genetic linkage association analysis and heritability of autism, and 

— observational work on autistic twins and probands with siblings on the spec¬ 
trum. 

Our framework also allows us to look ‘back’ in time. For example, by examin¬ 
ing the topics that the 1992 genetics topic originate from we discovered that the 
topic evolved from the early concept of “infantile ASD ” m- 
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Figure 4: Dynamics of the topic most closely associated with the concept of 
“genetics”. A few dominant words are shown for each topic (shaded boxes). 

4.4 Case study 2: ASD and vaccination 

For our second case study we chose to examine research on the relationship 
between ASD development and vaccination. This subject has attracted much 
attention both in the research community, as well as in the media and the gen¬ 
eral public. The controversy was created with the publication of the work by 
Wakefield [27] which reported epidemiological findings linking MMR vaccina¬ 
tion and the development of autism and colitis. Despite the full retraction of the 
article following the discovery that it was fraudulent, and numerous subsequent 
studies who failed to show the claimed link, a significant portion of the general 
public remains concerned with the issue. 

As in the previous example, we begun by identifying the topic with the 
highest probability of the terms “vaccine” and “vaccination” conditioned on the 
topic, and tracing it back to the epoch in which it first emerged. Again, a single 
topic was readily identified, in the epoch spanning the period 1996-2001. Notice 
that this is consistent with the publication date of the first relevant publication 
by Wakefield [27]. The evolution of the topic is illustrated in Figurein the same 
way as in the previous section. It can be seen that the original topic concerned 
the subjects initially brought to attention such as “measles”, “vaccine”, and “aut¬ 
ism”. In the subsequent epoch, when the original claim was still thought to have 
credibility, the topic evolves and splits into numerous others mirroring research 
directions taken by various researchers. Following this period and the revelations 
of its fraudulence, the topic assumes mainly single-threaded evolution, at times 
incorporating various originally separate ideas. For example observe the inde¬ 
pendent emergence of the term “mercury”. Though initially unrelated to it this 
topic merges with the topic that concerns vaccination which can be explained by 
the widely publicized thiomersal (vaccine preservative) controversy (again note 
that such merging of topics cannot be captured by the existing methods). Al- 








though rejected by the medical community due to a lack of evidence, this topic 
can be seen as persisting to date. 



2002-2007 

Figure 5: Dynamics of the topic most closely associated with the concept of 
“vaccination”. Notwithstanding the rejection of any link between vaccination 
and autism, this topic remains active albeit in a form which evolved over time. 


4.5 Topic browser 

A topic model can be seen as a dimensionality reduction framework that reduces 
documents into a topic space. This transformation of data can provide power¬ 
ful insight and allow for the browsing of documents in a more subject-specific, 
semantic manner. For example by describing documents in the topic space, doc¬ 
uments most related to a particular topic of interest can be readily identified and 
retrieved. To provide this functionality to the research community interested in 
ASD we used the framework described in this paper to model the entire literat¬ 
ure corpus we collected, and built a website to facilitate free and ready use of our 
model and data. Researchers can use our online tool to browse topics, annotate 
them, and navigate through publications by topic. The website is available at 
http://WWW.underStandingautism.tk, 


5 Conclusions 

We described a novel framework for temporal modelling of the topical structure 
of a longitudinal document corpus. Our approach consists of discretizing time 
into overlapping epochs, modelling the static topic structure within each epoch 
using an HDP, and tracking the evolution of topics over time using an inter¬ 
topic similarity measure. The resultant similarity graph captures relationships 
between topics in different epochs and allows for the automatic inference of the 
time of emergence and disappearance of topics, their evolution over time, merging 
and splitting. The power of the proposed general framework was demonstrated 
on the example of ASD-related medical literature. On two case studies which 
concern two important research issues in ASD literature we demonstrated that 
our method extracts meaningful topics and their temporal changes. A novel data 
corpus and free online tools are made freely available to researchers. 
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